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Abstract. Parsing is an important problem in computer science and yet surprisingly lit- 
tle attention has been devoted to its formal verification. In this paper, we present TRX: 
a parser interpreter formally developed in the proof assistant Coq, capable of producing 
formally correct parsers. We are using parsing expression grammars (PEGs), a formalism 
essentially representing recursive descent parsing, which we consider an attractive alter- 
native to context-free grammars (CFGs). From this formalization we can extract a parser 
for an arbitrary PEG grammar with the warranty of total correctness, i.e., the resulting 
parser is terminating and correct with respect to its grammar and the semantics of PEGs; 
both properties formally proven in Coq. 



1. Introduction 

Parsing is of major interest in computer science. Classically discovered by students as 
the first step in compilation, parsing is present in almost every program which performs 
data-manipulation. 

For instance, the Web is built on parsers. The HyperText Transfer Protocol (HTTP) 
is a parsed dialog between the client, or browser, and the server. This protocol transfers 
pages in HyperText Markup Language (HTML), which is also parsed by the browser. When 
running web-applications, browsers interpret JavaScript programs which, again, begins with 
parsing. Data exchange between browser(s) and server(s) uses languages or formats like 
XML and JSON. Even inside the server, several components (for instance the trio made of 
the HTTP server Apache, the PHP interpreter and the MySQL database) often manipulate 
programs and data dynamically; all require parsers. 

Parsing is not limited to compilation or the Web: securing data flow entering a network, 
signaling mobile communications, and manipulating domain specific languages (DSL) all 
require a variety of parsers. 

The most common approach to parsing is by means of parser generators, which take 
as input a grammar of some language and generate the source code of a parser for that 
language. They are usually based on regular expressions (REs) and context-free grammars 
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(CFGs), the latter expressed in Backus-Naur Form (BNF) syntax. They typically are able to 
deal with some subclass of context-free languages, the popular subclasses including LL(k), 
LR(k) and LALR(k) grammars. Such grammars are usually augmented with semantic 
actions that are used to produce a parse tree or an abstract syntax tree (AST) of the input. 

What about correctness of such parsers? Yacc is the most widely used parser generator 
and a mature program and yet the reference book about this tool [LMB92j devotes a whole 
section ("Bugs in Yacc") to discuss common bugs in its distributions. Furthermore, the 
code generated by such tools often contains huge parsing tables making it near impossible 
for manual inspection and/or verification. In the recent article about CompCert |Ler09j . 
an impressive project formally verifying a compiler for a large subset of C, the introduction 
starts with a question "Can you trust your compiler?". Nevertheless, the formal verification 
starts on the level of the AST and does not concern the parser |Ler09l Figure 1]. Can you 
trust your parser? 

Parsing expression grammars (PEGs) [ For04j are an alternative to CFGs, that have 
recently been gaining popularity. In contrast to CFGs they are unambiguous and allow 
easy integration of lexical analysis into the parsing phase. Their implementation is easy, as 
PEGs are essentially a declarative way of specifying recursive descent parsers [Bur 75] . With 
their backtracking and unlimited look-ahead capabilities they are expressive enough to cover 
all LL(k) and LR(k) languages as well as some non-context-free ones. However, recursive 
descent parsing of grammars that are not LL(k) may require exponential time. A solution 
to that problem is to use memoization giving rise to packrat parsing and ensuring linear 
time complexity at the price of higher memory consumption |AU721 IFor02b[ IFor02aj . It is 
not easy to support (indirect) left-recursive rules in PEGs, as they lead to non-terminating 
parsers [WDMnSj . 

In this paper we present TRX: a PEG-based parser interpreter formally developed in 
the proof assistant Coq |Coq[ IBC04| . As a result, expressing a grammar in Coq allows 
one, via its extraction capabilities [LetOSj . to obtain a parser for this grammar with total 
correctness guarantees. That means that the resulting parser is terminating and correct 
with respect to its grammar and the semantics of PEGs; both of those properties formally 
proved in Coq. Moreover every definition and theorem presented in this paper has been 
expressed and verified in Coq. 

Our emphasis is on the practicality of such a tool. We perform two case studies: on 
a simple XML format but also on the full grammar of the Java language. We present 
benchmarks indicating that the performance of obtained parsers is reasonable. We also 
sketch ideas on how it can be improved further, as well as how TRX could be extended into 
a tool of its own, freeing its users from any kind of interaction with Coq and broadening its 
applicability. 

This work was carried out in the context of improving safety and security of OPA (One 
Pot Application): an integrated platform for web development [RTS]. As mentioned above 
parsing is of uttermost importance for web-applications and TRX is one of the components 
in the OPA platform. 

The remainder of this paper is organized as follows. We introduce PEGs in Section [2] 
and in Section [3] we extend them with semantic actions. Section [5] describes a method 
for checking that there is no (indirect) left recursion in a grammar, a result ensuring that 
parsing will terminate. Section [5] reports on our experience with putting the ideas of the 
preceding sections into practice and implementing a formally correct parser interpreter 
in Coq. Section [6] is devoted to a practical evaluation of this interpreter and contains case 
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Figure 1: Parsing expressions 



studies of extracting XML and Java parsers from it, presenting a benchmark of TRX against 
other parser generators and giving an account of our experience with extraction. We discuss 
related work in Section [71 present ideas for extensions and future work in Section [8] and 
conclude in Section [9l 



2. Parsing Expression Grammars (PEGs) 

The content of this section is a different presentation of the results by Ford [For04] . For 
more details we refer to the original article. For a general overview of parsing we refer to, 
for instance, Aho, Seti & Ullman |ASU86j . 

PEGs are a formalism for parsing that is an interesting alternative to CFGs. We will 
formally introduce them along with their semantics in Section 12.11 PEGs are gaining pop- 
ularity recently due to their ease of implementation and some general desirable properties 
that we will sketch in Section 12.21 while comparing them to CFGs. 



2.1. Definition of PEGs. 

Definition 2.1 (Parsing expressions). We introduce a set of parsing expressions, A, over a 
finite set of terminals Vt and a finite set of non-terminals Vn- We denote the set of strings 
as S and a string s G 5 is a list of terminals Vt- The inductive definition of A is given in 
Figure [TJ o 

Later on we will present the formal semantics but for now we informally describe the 
language expressed by such parsing expressions. 

• Empty expression e always succeeds without consuming any input. 

• Any-character [■], a terminal [a] and a range [a — z] all consume a single terminal from 
the input but they expect it to be, respectively: an arbitrary terminal, precisely a and in 
the range between a and z. 

• Literal ["s"] reads a string (i.e., a sequence of terminals) s from the input. 

• Parsing a non-terminal A amounts to parsing the expression defining A. 

• A sequence ei;e2 expects an input conforming to ei followed by an input conforming to 

62- 

• A choice ei/e2 expresses a prioritized choice between ei and 62- This means that 62 will 
be tried only if ei fails. 

• A zero-or-more (resp. one-or-more) repetition e* {resp. e+) consumes zero-or-more {resp. 
one-or-more) repetitions of e from the input. Those operators are greedy, i.e., the longest 
match in the input, conforming to e, will be consumed. 
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Figure 2: Formal semantics of PEGs 



• An and-predicate (resp. not-predicate) fee {resp. !e) succeeds only if the input conforms 
to e {resp. does not conform to e) but does not consume any input. 

We now define PEGs, which are essentially a finite set of non-terminals, also referred to as 
productions, with their corresponding parsing expressions. 

Definition 2.2 (Parsing Expressions Grammar (PEG)). A parsing expressions grammar 
(PEG), Q, is a tuple (Vr, Vat, Pexp, "(^start), where: 

• Vt is a finite set of terminals, 

• Vat is a finite set of non-terminals, 

• Pexp is the interpretation of the productions, i.e., Pexp 

: Vat — ;> A and 

• ^start is the start production, f start £ Vat. o 

We will now present the formal semantics of PEGs. The semantics is given by means 
of tuples (e, s) r, which indicate that parsing expression e G A applied on a string s & S 
gives, in m steps, the result r, where r is either _L, denoting that parsing failed, or -y/^,, 
indicating that parsing succeeded and s' is what remains to be parsed. We will drop the m 
annotation whenever irrelevant. 

The complete semantics is presented in Figure [21 Please note that the following opera- 
tors from Definition 12.11 can be derived and therefore are not included in the semantics: 
[a—z] ::= [a] / ... / [z] e+ ::= e;e* &e ::= lie 

[V] ::= [so]; ... ; M e? ::= e/e 



2.2. CFGs vs PEGs. The main differences between PEGs and CFGs are the following: 

• the choice operator, ei/e2, is prioritized, i.e., 62 is tried only if ei fails; 

• the repetition operators, e* and e+, are greedy, which allows to easily express "longest- 
match" parsing, which is almost always desired; 

• syntactic predicates |PQ94| , &:e and !e, both of which consume no input and succeed if e, 
respectively, succeeds or fails. This effectively provides an unlimited look-ahead and, in 
combination with choice, limited backtracking capabilities. 
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An important consequence of the choice and repetition operators being deterministic (choice 
being prioritized and repetition greedy) is the fact that PEGs are unambiguous. We wih 
see a formal proof of that in Theorem 13.51 This makes them unfit for processing natural 
languages, but is a much desired property when it comes to grammars for programming 
languages. 

Another important consequence is ease of implementation. Efficient algorithms are 
known only for certain subclasses of CFGs and they tend to be rather complicated. PEGs are 
essentially a declarative way of specifying recursive descent parsers [Bur 75 ] and performing 
this type of parsing for PEGs is straightforward (more on that in Section [5]). By using the 
technique of packrat parsing |AU721 IFor02bj . i.e., essentially adding memoization to the 
recursive descent parser, one obtains parsers with linear time complexity guarantees. The 
downside of this approach is high memory requirements: the worst-time space complexity 
of PEG parsing is linear in the size of the input, but with packrat parsing the constant of 
this correlation can be very high. For instance Ford reports on a factor of around 700 for a 
parser of Java |For02b] . 



CFGs work hand-in-hand with REs. The lexical analysis, breaking up the input into 
tokens, is performed with REs. Such tokens are subject to syntactical analysis, which is 
executed with CFGs. This split into two phases is not necessary with PEGs, as they make 
it possible to easily express both lexical and syntactical rules with a single formalism. We 
will see that in the following example. 

Example 2.3 (PEG for simple mathematical expressions). Consider a PEG for simple 
mathematical expressions over 5 non-terminals: Vn '■■= {ws, number, term, factor, expr} 
with the following productions (Pcxp function from Definition 12. 2p : 

ws ::= m I {\t\y 

number ::= [0—9] + 

term ::= ws number ws / ws [(] expr [)] ws 

factor ::= term [*] factor / term 

expr ::= factor [+] expr / factor 

Please note that in this and all the following examples we write the sequence operator e\; 62 
implicitly as e\ 62- The starting production is Vstart -= expr. 

First, let us note that lexical analysis is incorporated into this grammar by means 
of the ws production which consumes all white-space from the beginning of the input. 
Allowing white-space between "tokens" of the grammar comes down to placing the call to 
this production around the terminals of the grammar. If one does not like to clutter the 
grammar with those additional calls then a simple solution is to re-factor all terminals into 
separate productions, which consume not only the terminal itself but also all white-space 
around it. 

Another important observation is that we made addition (and also multiplication) right- 
associative. If we were to make it, as usual, left-associative, by replacing the rule for expr 
with: 

expr ::= expr [+] factor / factor 

then we get a grammar that is left-recursive. Left-recursion (also indirect or mutual) is 
problematic as it leads to non-terminating parsers. We will come back to this issue in 
Section H < 
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PEGs can also easily deal with some common idioms often encountered in practical 
grammars of programming languages, which pose a lot of difficulty for CFGs, such as 
modular way of handling reserved words of a language and a "dangling" else problem — 
we present them on two examples and refer for more details to Ford jFor02a[ Chapter 2.4]. 

Example 2.4 (Reserved words). One of the difficulties in tokenization is that virtually 
every programming language has a list of reserved words, which should not be accepted as 
identifiers. PEGs allow an elegant pattern to deal with this problem: 



The rule identifier for identifiers reads a non-empty list of letters but only after checking, 
with the not-predicate, that there is no reserved word at this position. The rules for the 
reserved words ensure that it is not followed by a letter ("ifs" is a valid identifier) and 
consume all the following white space. In this example we only presented a single reserved 
word "if" but adding a new word requires only adding a rule similar to IF and extending 
the choice in reserved. < 

Example 2.5 ("Dangling" else). Consider the following part of a CFG for the C language: 

stmt ::= IF ( expr ) stmt 

I IF ( expr ) stmt ELSE stmt 

According to this grammar there are two possible readings of a statement 

if (ei) if (62) si else S2 

as the "else S2" branch can be associated either with the outer or the inner if. The desired 
way to resolve this ambiguity is usually to bind this else to the innermost construct. This 
is exactly the behavior that we get by converting this CFG to a PEG by replacing the 
symmetrical choice operator "|" of CFGs with the prioritized choice of PEGs "/". <l 



3.1. XPEGs: Extended PEGs. In the previous section we introduced parsing expres- 
sions, which can be used to specify which strings belong to the grammar under consideration. 
However the role of a parser is not merely to recognize whether an input is correct or not 
but also, given a correct input, to compute its representation in some structured form. 
This is typically done by extending grammar expressions with semantic values, which are a 
representation of the result of parsing this expression on (some) input and by extending a 
grammar with semantic actions, which are functions used to produce and manipulate the 
semantic values. Typically a semantic value associated with an expression will be its parse 
tree so that parsing a correct input will give a parse tree of this input. For programming 
languages such parse tree would represent the AST of the language. 

In order to deal with this extension we will replace the simple type of parsing expressions 
A with a family of types A^, where the index a is a type of the semantic value associated 
with the expression. We also compositionally define default semantic values for all types 



identifier 
reserved 
IF 



Ireserved letter+ ws 

IF / ... 

["i/"] lletter ws 



3. Extending PEGs with Semantic Actions 
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Figure 3: Typing rules for parsing expressions with semantic actions 

of expressions and introduce a new construct: coercion, 6[i— >•]/, which converts a semantic 
value V associated with e to f{v). 

Borrowing notations from Coq we will use the following types: 

• Type is the universe of types. 

• True is the singleton type with a single value /. 

• char is the type of machine characters. It corresponds to the type of terminals Vtj which 
in concrete parsers will always be instantiated to char. 

• list a is the type of lists of elements of a for any type a. Also string ::= list char. 

• cti * ... * an is the type of n-tuples of elements (ai, . . . , a„) with oi G ai, . . . , a„ G a„ for 
any types ai, . . . , a„. If v is an n-tuple then Vi is its i'th projection. 

• option a is the type optionally holding a value of type a, with two constructors None and 
Some V with v : a. 

Definition 3.1 (Parsing expressions with semantic values). We introduce a set of parsing 
expressions with semantic values, A^, as an inductive family indexed by the type a of 
semantic values of an expression. The typing rules for A^ are given in Figure [3j o 

Note that for the choice operator 61/62 the types of semantic values of 61 and 62 must 
match, which will sometimes require use of the coercion operator 6[i— )•]/. 

Let us again see the derived operators and their types, as we need to insert a few 
coercions: 
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The definition of an extended parsing expression grammar (XPEG) is as expected 
(compare with Definition 12. ip . 



Definition 3.2 (Extended Parsing Expressions Grammar (XPEG)). An extended parsing 
expressions grammar (XPEG), Q, is a tuple (Vt, Vat, Ptype, Pexp, I'start), where: 

• Vt is a finite set of terminals, 

• Vat is a finite set of non-terminals, 

• Ptype '■ Viv — >■ Type is a function that gives types of semantic values of all productions. 
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Figure 4: Formal semantics of XPEGs with semantic actions. 

• Pexp is the interpretation of the productions of the grammar, i.e., Pexp : VA:Viv^Ptypc{A) 
and 

• ^start is the start production, Vstart G Vat. o 

We extended the semantics of PEGs from Figure [2] to semantics of XPEGs in Figure [H 

Example 3.3 (Simple mathematical expressions ctd.). Let us extend the grammar from 
Example 12.31 with semantic actions. The grammar expressed mathematical expressions and 
we attach semantic actions evaluating those expressions, hence obtaining a very simple 
calculator. 

It often happens that we want to ignore the semantic value attached to an expression. 
This can be accomplished by coercing this value to /, which we will abbreviate by e[tj] ::= 
e Ax . /. 



ws 

number 
term 



(U / M)* 



[0-9]+ \^\ digListToNat 

ws number ws [i— Ax . X2 

/ ws [(] expr [)] ws [i— t-] Ax . X3 
factor ::= term [*] factor [1— ?•] Ax . xi * X3 
/ term 

expr ::= factor [+] expr [1— )•] Ax . xi + X3 
/ factor 

where digListToNat converts a list of digits to their decimal representation and Xj in the 
productions is the i-th projection of the vector of values x, resulting from parsing a sequence. 

This grammar will associate, as expected, the semantical value 36 with the string "(1+2) 
* (3 * 4)". Of course in practice instead of evaluating the expression we would usually 
write semantic actions to build a parse tree of the expression for later processing. <l 
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3.2. Meta-properties of (X)PEGs. Now we will present some results concerning seman- 
tics of (X)PEGs. They are all variants of results obtained by Ford jFor04j . only now we 
extend them to XPEGs. First we prove that, as expected, the parsing only consumes a 
prefix of a string. 

Theorem 3.4. If (e, s) -y/^, then s' is a suffix of s. 

Proof. Induction on the derivation of (e, s) -y/^/ using transitivity of the prefix property 
for sequence and repetition cases. □ 

As mentioned earlier, (X)PEGs are unambiguous: 

Theorem 3.5. // (e, s) ~^ ri and (e, s) r2 then mi = m2 and ri = r2. 

Proof. Liduction on the derivation (e, s) ri followed by inversion of (e, s) r2. All cases 
immediate from the semantics of XPEGs. □ 

We wrap up this section with a simple property about the repetition operator, that we 
will need later on. It states that the semantics of a repetition expression e* is not defined 
if e succeeds without consuming any input. 

Lemma 3.6. // (e, s) then (e*, s) -/^ r for all r. 

Proof. Assume (e,s) -^/^ and (e*,s) -y/^f for some n, vs and s' (we cannot have 
(e*,s) A _L as e* never fails). By the first rule for repetition (e*,s) ""tlj^^ y/^s''^^ ^ which 
contradicts the second assumption by Theorem 13.51 □ 



4. Well-formedness of PEGs 

We want to guarantee total correctness for generated parsers, meaning they must be correct 
(with respect to PEGs semantics) and terminating. In this section we focus on the latter 
problem. Throughout this section we assume a fixed PEG Q. 

4.1. Termination problem for XPEGs. Ensuring termination of a PEG parser essen- 
tially comes down to two problems: 

• termination of all semantic actions in Q and 

• completeness of Q with respect to PEGs semantics. 

As for the first problem it means that all / functions used in coercion operators e[i— )■]/ 
in must be terminating. We are going to express PEGs completely in Coq (more on that 
in Section [5]) so for our application we get this property for free, as all Coq functions are 
total (hence terminating). 

Concerning the latter problem, we must ensure that the grammar Q under consideration 
is complete, i.e., it either succeeds or fails on all input strings. The only potential source of 
incompleteness of Q is (mutual) left-recursion in the grammar. 

We already hinted at this problem in Example 12.31 with the rule: 

expr ::= expr [-|-] factor / factor 
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a G Vt a G Vt e G P_l e G P>o 
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^GP^ ei;e2GP± 

(ei G P>o A 62 G P>o) V (ei G P>o A 62 G P>o) ei G Pq 62 G Pq 



ei;e2 G P>o ei;e2 G Pq 

ei G Po V (ei G P_L A 62 G Pq) ei G P_l 62 G P^ 

61/62 G Po 61/62 G P_L 

61 G P>o V (ei G P_L A 62 G P>o) 6 G P± e G P>o 

61/62 G P>o !6 G Po !6 G P_L 

Figure 5: Deriving grammar properties. 

Recursive descent parsing of expressions with this rule would start with recursively calling 
a function to parse expression on the same input, obviously leading to an infinite loop. But 
not only direct left recursion must be avoided. In the following rule: 

A ::= B / C !D A 

a similar problem occurs provided that B may fail and C and D may succeed, the former 
without consuming any input. 

While some techniques to deal with left-recursive PEGs have been developed recently 
|WDM08| ■ we choose to simply reject such grammars. In general it is undecidable whether 
a PEG grammar is complete, as it is undecidable whether the language generated by Q is 
empty [For04] . 

While in general checking grammar completeness is undecidable, we follow Ford [For04] 
to develop a simple syntactical check for well-formedness of a grammar, which implies its 
completeness. This check will reject left-recursive grammars even if the part with left- 
recursion is unreachable in the grammar, but from a practical point of view this is hardly 
a limitation. 



4.2. PEG analysis. We define the expression set of G as: 

E(g) = {6' I 6' □ 6, 6 G Pexp(^), ^ G V^} 

where C is a (non-strict) sub-expression relation on parsing expressions. 
We define three groups of properties over parsing expressions: 

• "0": parsing expression can succeed without consuming any input, 

• "> 0" : parsing expression can succeed after consuming some input and 

• "_L" : parsing expression can fail. 

We will write 6 G Pq to indicate that the expression e has property "0" (similarly for 
P>o and P_l). We will also write e G P>o to denote 6 G Po V e G P>o. We define inference 
rules for deriving those properties in Figure \E\ 

We start with empty sets of properties and apply those inference rules over E(^) until 
reaching a fix-point. The existence of the fix-point is ensured by the fact that we extend 
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^ e WF 



e G WF 



[•] e WF [a] G WF 



!e E WF 



ei G WF 



ei G Po ^ 62 G WF ei G WF 



62 G WF 



e G WF 



e^Po 




ei/e2 G WF 



e* G WF 



Figure 6: Deriving the well-formedness property for a PEG. 



those property sets monotonically and they are bounded by the finite set E(t/). We sum- 
marize the semantics of those properties in the following lemma: 

Lemma 4.1 ( [For04] ) . For arbitrary e G A and s £ S: 

• if (e, s) then e G Pq, 

• if (e, s) a/,/ and \s'\ < \s\ then e G P>o and 

• if (e, s) _L then e G P^. 

Proof. Induction over n. All cases easy by the induction hypothesis and semantical rules of 
XPEGs, except for e* which requires use of Lemma 13.61 □ 

Those properties will be used for establishing well-formedness of a PEG, as we will see 
in the following section. It is worth noting here that checking whether e G Pq also plays a 
crucial role in the formal approach to parsing developed by Danielsson [DanlO] (we will say 
more about his work in Section [7]). 

It is also interesting to consider such a simplified analysis in our setting, i.e., only 
considering e G Pq and collapsing derivations of Figure [5] by assuming e G P>o and e G P_l 
hold for every expression e. At first it seems we would lose some precision by such an 
over-approximation as for instance that would lead us to conclude !e G Pq, whereas in fact 
this expression can never succeed without consuming any input (as, quite simply, it can 
never succeed). As we will see soon this would lead us to reject a valid definition: 



However, this definition of A is not very interesting as it always fails. In fact, we conjec- 
ture that the differences occur only in such degenerated cases and that in practice such a 
simplified analysis would be as efficient as that of [For04] . 

4.3. PEG well-formedness. Using the semantics of those properties of parsing expression 
we can perform the completeness analysis of Q. We introduce a set of well- formed expressions 
WF and again iterate from an empty set by using derivation rules from Figure [6] over E(^) 
until reaching a fix-point. 

We say that Q is well- formed if E(^) = WF. We have the following result: 

Theorem 4.2 ( |For04] ). If Q is well-formed then it is complete. 

Proof. We will say that (e, s) is complete iff 3n^r (e, s) r. So we have to prove that (e, s) 
is complete for all e G E(^) and all strings s. We proceed by induction over the length of the 
string s (IHout), followed by induction on the depth of the derivation tree of e G WF (IHjn). 
So we have to prove correctness of a one step derivation of the well-formedness property 
(Figure [6]) assuming that all expressions are total on shorter strings. The interesting cases 
are: 



A 



!e; A 
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• For a sequence ei;e2 if ei;e2 G WF then ei G WF, so (ei,s) is complete by IHin. If ei 
fails then ei;e2 fails. Otherwise (ei,s) -y^,. If s = s' then ei G Pq (Lemma 14. ip and 
hence 62 G WF and (e2,s') is complete by IHjn. If s 7^ s' then < \s\ (Theorem 13. 4p 
and (e2,s') is complete by IHout- Either way (e2,s') is complete and we conclude by 
semantical rules for sequence. 

• For a repetition e*, e G WF gives us completeness of (e,s) by IHin. If e fails then we 
conclude by the base rule for repetition. Otherwise (e*,s) s' with < |s| as e ^ Pq. 
Hence we get completeness of {e*,s') by IHout and we conclude with the inductive rule 
for repetition. □ 



5. Formally Verified XPEG interpreter 

In this Section we will present a Coq implementation of a parser interpreter. This task 
consists of formalizing the theory of the preceding sections and, based on this, writing an 
interpreter for well- formed XPEGs along with its correctness proofs. The development is 
too big to present it in detail here, but we will try to comment on its most interesting 
aspects. 

We will describe how PEGs are expressed in Coq in Section 15. H comment on the proce- 
dure for checking their well-formedness in Section [5.21 and describe the formal development 
of an XPEG interpreter in Section 15.31 



5.1. Specifying XPEGs in Coq. XPEGs in Coq are a simple reflection of Definition[ 
They are specified over a finite enumeration of non-terminals (corresponding to Vn) with 
their types (Ptype): 

Parameter prod : Enumeration. 
Parameter prod-type : prod — )• Type. 

Building on that we define: 

• pexp: un- typed parsing expressions. A, and 

• PExp: their typed variant, A^, which follows the typing discipline from Figure [3l 
We present both definitions side by side: 



Inductive pexp : Type := 
I empty 
I anyChar 

I terminal (a : char) 
I range {a z : char) 
I nonTerminal (p : prod) 
I seq {el e2 : pexp) 
I choice {el e2 : pexp) 
I star (e : pexp) 
I not (e : pexp) 
I id {e : pexp). 



Inductive PExp : Type — )• Type := 
I Empty : PExp True 
I AnyChar : PExp char 
I Terminal : char — )• PExp char 
I Range : char * char — )• PExp char 
I NonTerminal : V p, PExp {prod_type p) 
\ Seq:\/ AB, PExp A PExp B PExp {A * B) 
I Choice -.y A, PExp A PExp A PExp A 
I Star : V A, PExp A PExp {list A) 
\ Not:\/ A, PExp A PExp True 
I Action -.y AB, PExp A {A B) PExp B. 



Those definitions are straight-forward encodings of Definitions 12.11 and 13. 1 1 We implemented 
the range operator [a—z] as a primitive, as in practice it occurs frequently in parsers and 
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implementing it as a derived operation by a choice over all the characters in the range is 
inefficient. That means that in the formalization we had to extend the semantics of Figured] 
with this operator, in a straightforward way. 

It is worth noting here that PExp is large, in terms of Coq universe levels, as its index 
lives in Type. We never work with propositional equality of types, so the constraints on types 
used in constructors of PExp, come only from the inductive definition itself. In particular, 
PExp must live at a higher universe level than any type used in its constructors. 

For "regular" use of our parsing machinery this should pose no problems. However, 
should we want to develop some higher-order grammars (grammars that upon parsing return 
another grammar) we would very soon run into Coq's Universe Inconsistency problems. In 
fact higher-order grammars are not expressible in our framework anyway, due to the use of 
Coq's module system. We will return to this issue in Section [8l 

With pexp and PExp in place we continue by defining, in an obvious way, conversion 
functions from one structure to the another. 

Fixpoint pexp_project T {e : PExp T) : pexp := {■■■} 
Fixpoint pexp -promote (e : pexp) : PExp True := {■■■} 

Conversion from PExp to pexp simply erases types and maps Actions to dummy constructor 
id. Conversion in the other direction maps to expressions of a singleton type True, inserting, 
where needed, type coercions using Action operator. 

To complete the definition of XPEG grammar. Definition 13. 2( we declare definitions of 
non-terminals (Pexp) and the starting production (fgtart) as: 

Parameter production : V p : prod, PExp {prod-type p). 
Parameter start : prod. 

There are two observations that we would like to make at this point. First, by means 
of the above embedding of XPEGs in Coq, every such XPEG is well-defined (though not 
necessarily well- formed) . In particular there can be no calls to undefined non-terminals and 
the conformance with the typing discipline from Figure[3]is taken care of by the type-checker 
of Coq. 

Secondly, thanks to the use of Coq's mechanisms, such as notations and coercions, 
expressing an XPEG in Coq is still relatively easy as we will see in the following example. 

Example 5.1. Figure [7] presents a precise Coq rendering of the productions of the XPEG 
grammar from Example 13.31 It is not much more verbose than the original example. Each 
Pi function corresponds to z'th projection and they work with arbitrary n-tuples thanks to 
the type-class mechanism. < 

5.2. Checking well-formedness of an XPEG. To check well-formedness of XPEGs we 
implement the procedure from Section [H It is worth noting that the function to compute 
XPEG properties, by iterating the derivation rules of Figure [5] until reaching a fix-point, is 
not structurally recursive. Similarly for the well-formedness check with rules from Figure [H 
Fortunately the Program feature [Soz07] of Coq makes specifying such functions much 
easier. We illustrate it on the well-formedness check (computing properties is analogous). 
We begin by one-step well-formedness derivation corresponding to Figure [H 

Definition wf .analyse {exp : pexp) {wf : PES.t) : bool := 
match exp with 
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Program Definition production p := 

match p return PExp {prod_type p) witii 
I ws =^ (" " / "\t") [*] [#] 
I number =^ ["0" --"9"] [ + ] [^] digListToRat 
I term =^ ws; number; ws [— t-] {Xv =^ P2 v) 
I ws; " ("; expr; ") "; ws (Au =^ P3 v) 
I factor =^ ierm; "*";factor [^] [Xv ^ PI v * P3 v) 
I term 

I expr =^ factor; " + "; expr [— (At; =^ PI v + P3 v) 

I factor 

end. 

Figure 7: A Coq version of the XPEG for mathematical expressions from Example 13.31 

I empty =^ true 

I range =^ true 

I terminal a =^ true 
I anyChar =^ irue 

I nonTerminal p =^ is_«;/ [production p) wf 

I seq' ei e2 fs_w;/ ei w;/ A (if ei — [gp] — t- then is-wf e2 wf else true) 
I choice el e2 =^ is_w/ ei w;/ A is-wf e2 wf 
I stor e =^ zs.iii/ e wf A {negb (e — [^p] — )• 0)) 
I not e =^ is_wf e wf 
\ id e ^ is-wf e wf 
end. 

This function take a set of well- formed expressions computed so far (PES standing for 
"parsing expression set") and an expression exp and returns true iff exp should also be 
consider well-formed, according to the derivation system of Figure [6l Here gp is the set of 
global properties computed following the procedure of Section 14.21 (again, we do not show 
the code here, as that procedure is very analogous to the inference of well-formedness, that 
we describe). Hence e — [gp] — should be read as e € Pq and is_wf is an abbreviation for 
set membership, i.e.: 

Definition is-wf : pexp — )• PES.t — )• bool := PES .mem. 

With that in place we continue with a simple function that extends the set of well- 
formed expressions with the one being considered now, in case it was established to be 
well-formed by invocation of wf ^analyse and otherwise leaves this set unchanged. 

Definition wf _analyse-exp {exp : pexp) {wf : PES.t) : PES.t := 
if wf _analyse exp wf then PES. add exp wf else wf . 

Now the one step derivation over all expressions E(^), represented by the constant 
grammarExpSet below, can be realized as a simple fold operation using the above function: 

Definition wf -derive {wf : PES.t) : PES.t := 
PES.fold wf -analyse -exp grammarExpSet wf. 
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Now, the complete analysis is a fixpoint of applying one-step derivation wf -derive. 

Program Fixpoint wf -compute {wf : WFset) {measure wf -measure wf} : WFset := 
let wf := wf -derive wf in 

if PES .equal wf wf then wf else wf -compute wf. 
Here WFset is a set of well- formed expressions: 
Definition WFset := {e : PES.t \ wf -prop e} 

where wf -prop is a predicate capturing well-formedness of an expression. 

The main difficulty here is that wf -compute is not structurally recursive. However, we 
can construct a measure (into N) that will decrease along recursive calls as: 

wf -measure ::= | E(^)| — 

Now we can prove this procedure terminating, as the set of well-formed expressions is 
growing monotonically and is contained in E(^): 

wf ^ wf -derive wf 

wf C E(^) =^ wf -derive wf C E(C/) 

The Program feature |Soz07j of Coq, is very helpful in expressing such non structurally re- 
cursive functions, as well as in general programming with dependent types. The downside of 
Program is that it inserts type casts, making reasoning about such functions more difficult. 
This can be usually overcome with the use of sigma-types capturing the function specifica- 
tion (wf-prop in our example) together with its return value. This style of programming 
seems to be particularly well suited when working with Program. 

Finally we obtain the set of well-formed expressions of a grammar by iterating to a 
fix-point, starting with an empty set: 

Program Definition WFexps : PES.t := wf -compute PES .empty . 

a grammar expression exp is well-formed if it belongs to this set 

Definition WF {exp : pexp) : Prop := PES. In exp WFexps. 

and a grammar is well-formed if all its expressions are well-formed: 

Definition grammar -WF : Prop := grammarExpSet [=] WFexps. 

Above we presented a complete code of the well-formedness analysis (Section 14. 3p . 
excluding the inference of properties (Section 14. 2p . Naturally, every of those functions is 
accompanied with some lemmas stating its correctness and their proofs. Those proofs, with 
Ltac definitions used to discard them, constitute roughly 4-5x the size of the definitions. 
This factor is so low thanks to heavy use of Ltac automation in the proofs; the proof style 
advocated by Chlipala [Chl09] . which we, eventually, learned to embrace fully. 

Our interpreter (more on it in the following section) will work on XPEGs, not on PEGs. 
However, the termination analysis sketched above considers un-typed parsing expressions 
pexp, obtained by projecting XPEGs expressions (with pexp -project). The reason is two- 
fold. 

Firstly, semantic actions are embedded in Coq's programming language and hence are 
terminating and have no influence on the termination analysis of the grammar. Hence 
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a termination of the parser on expression e : PExp T is immediate from termination of 
pexp -project e : pexp. 

Secondly, the well-formedness procedure presented above needs to maintain a set of 
parsing expressions ( WFset) and for that we need a decidable equality over parsing expres- 
sions. Equality over is not decidable, as, within coercion operator e[i— )•]/ they contain 
arbitrary functions /. 

An alternative approach would be to consider WFset modulo an equivalence relation on 
parsing expressions coarser than the syntactic equality, which would ignore / components in 
e[i— )•]/ coercions. That would avoid formalization of the un- typed structure pexp altogether 
for the price of reasoning with dependently typed PExp^s in the well-formedness analysis. 

5.3. A formal interpreter for XPEGs. For the development of a formal interpreter for 
XPEGs we used the ascii type of Coq for the set of terminals Vt- The string type from the 
standard library of Coq is isomorphic to lists of characters. In its place we just used a list 
of characters, in order to be able to re-use a rich set of available functions over lists. 
First let us define the result of parsing an expression PExp T on some string: 

Inductive ParsingResult (T : Type) : Type := 
I PR Jail. 

I PR_ok (s : string) {v : T) 

i.e., a parsing can either fail (PR-fail) or succeed [PR_ok s v), in which case we obtain a 
suffix s that remains to be parsed and an associated semantic value v. 

Now after requiring a well-formed grammar, interpreter can be defined as a function 
with the following header: 

Variable GWF : grammar _WF. 

Program Fixpoint parse (T : Type) (e : PExp T \ is -grammar _exp e) (s : string) 
{measure (e, s) ^} : {r : ParsingResult T \ 3 n,[e, s] =^ [n, r]} 

So this function takes three arguments (the first one implicit): 

• T: a type of the result of parsing (a), 

• e: a parsing expression of type T (A^), with a proof {is .grammar _exp e) that it belongs 
to the grammar Q (which in turn is checked beforehand to be well-formed) and 

• s: a string to be parsed. 

The last line in the above header describes the type of the result of this function, where 
[e,s] =^ [n,r] is the expected encoding of the semantics from Figure H] and corresponds 
to (e,s) r. So the parse function produces the parsing result r (either _L or ^J1, with 
V : T), such that (e,s) ~^ r for some n, i.e., it is correct with respect to the semantic of 
XPEGs. 

The body of the parse function performs pattern matching on expression e and inter- 
prets it according to the semantics from Figure [2j We show a simplified (the actual pattern 
matching is slightly more involved due to dealing with dependent types) excerpt of this 
function for a few types of expressions: 

match e with 

I Empty =^ Ok s I 
I Terminal c ^ 
match s with 
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I nil =^ Fail 
I X :: xs ^ 

match CharAscii.eq-dec c x with 

I left _ =^ Ok xs c 

I right _ =^ Fail 

end 
end 

I NonTerminal p ^ parse (production p) s 
I Choice - el e2 ^ 

match parse el s with 

I PR_ok s' V ^ Ok s' V 

I PR Jail =^ parse e2 s 

end 
I Star _ e 

match parse e s with 

I PR Jail ^ Oks[] 

I PR.ok s' V ^ 
match parse (e [*]) s' with 
I PR Jail =^ ! 

I PR_ok s" v' =^ Ok s" (v :: v') 
end 
end 
I Not _ e ^ 
match parse e s with 
I PR.ok _ _ ^ Fail 
I PR Jail ^ Ok s I 
end 

I Action e / =^ 

match parse e s with 

I PR_ok s' Ok s' (/ v) 

I PR Jail Fail 
end 

end 

The termination argument for this function is based on the decrease of the pair of arguments 
(e, s) in recursive calls with respect to the following relation y. 

(ei,Si) ^ (e2,S2) <^=^ 3„i,ri,n2,r2 (ei,Sl) ~^ n A (62,52) Ani > n2 

So (ei,si) is bigger than (62,52) in the order if its step-count in the semantics is bigger. 
The relation >- is clearly well-founded, due to the last conjunct with >, the well-founded 
order on N. Since the semantics of Q is complete (due to Theorem 14.21 and the check for 
well-formedness of G as described in Section 15. 2p we can prove that all recursive calls are 
indeed decreasing with respect to 
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Clearly this function also generates a number of proof obligations for expressing cor- 
rectness of the returned result with respect to the semantics of PEGs. Dismissing them is 
actually rather straightforward, due to the fact that the implementation of the interpreter 
and the operation semantics of PEGs are very close to each other. That means that by far 
the majority of our work was in establishing termination, not correctness. 



6. Extracting a Parser: Practical Evaluation 

In the previous section we described a formal development of an XPEG interpreter in the 
proof assistant Coq. This should allow us for an arbitrary, well- formed XPEG G, to specify 
it in Coq and, using Coq's extraction capabilities [Let08| . to obtain a certified parser for 
G. We are interested in code extraction from Coq, to ease practical use of TRX and to 
improve its performance. At the moment target languages for extraction from Coq are 
OCaml ; L+96] . Haskell !PJ+n2) and Scheme fSJ98] . We use the FSets [FL04] library (part 



of the Coq standard library for manipulation of the set data-type) developed using Coq's 
modules and functors [Chr03j . which are not yet supported by extraction to Haskell or 
Scheme. However, there is an ongoing work on porting FSets to type classes |SO08j . which 
are supported by extraction. 

First, in Section [6.11 we will sketch the various performance-related improvements that 
we made along our development and present case studies on two examples: XML and Java. 
Then in Section [6. 21 we will present a benchmark of certified TRX again a number of other 
tools on those two examples. 



6.1. Case study of TRX on XML and Java. A well-known issue with extraction is 
the performance of obtained programs [ CFL061 ILetOSj . Often the root of this problem is 
the fact that many formalizations are not developed with extraction in mind and trying to 
extract a computational part of the proof can easily lead to disastrous performance [CFL06] . 
On the other hand the CompCert project |Ler09] is a well-known example of extracting a 
certified compiler with satisfactory performance from a Coq formalization. 

As most of TRX's formalization deals with grammar well-formedness, which should be 
discarded in the extracted code, we aimed at comparable performance for certified TRX 
and its non-certified counterpart that we prototyped manually. We found however that the 
first version's performance was unacceptable and required several improvements, which we 
will discuss in the remainder of this section. 

We started with a case study of XML using an XML PEG developed internally at 
MLstate. The first extracted version of TRX-cert parsed 32kB of XML in more than one 
minute. To our big surprise, performance was somewhere between quadratic and cubic with 
rather large constants. To our even bigger surprise, inspection of the code revealed that the 
rev function from Coq's standard library (from the module Coq. Lists. List) that reverses 
a list was the source of the problem. The rev function is implemented using append to 
concatenate lists at every step, hence yielding quadratic time complexity. 

We used this function to convert the input from OCaml strings to the extracted type 
of Coq strings. This is another difficulty of working with extracted programs: all the data- 
types in the extracted program are defined from scratch and combining such programs with 
un-certified code, even just to add a minimal front-end, as in our case, sometimes requires 
translating back and forth between OCaml's primitive types and the extracted types of 
Coq. 
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Fixing the problem with rev resulted in a linear complexity but the constant was still 
unsatisfactory. We quickly realized that implementing the range operator by means of 
repeated choice is suboptimal as a common class of letters [or-z] would lead to a composition 
of 26 choices. Hence we extended the semantics of XPEGs with semantics of the range 
operator and instead of deriving it implemented it "natively" . 

Yet another surprise was in store for us as the performance instead of improving got 
worse by approximately 30%. This time the problem was the fact that in Coq there is no 
predefined polymorphic comparison operator (as in OCaml) so for the range operation we 
had to implement comparison on characters. We did that by using the predefined function 
from the standard library converting a character to its ASCII code. And yet again we 
encountered a problem that the standard library is much better suited for reasoning than 
computing: this conversion function uses natural numbers in Peano representation. By 
re-implementing this function using natural numbers in binary notation (available in the 
standard library) we decreased the running time by a factor of 2. 

Further profiling the OCaml program revealed that it spends 85% of its time perform- 
ing garbage collection (GC). By tweaking the parameters of OCaml's GC, we obtained an 
important 3x gain, leading to TRX-cert's current performance as presented in the following 
section. We believe a more careful inspection will reveal more potential sources of improve- 
ments, as there is still a gap between the performance that we reached now and the one of 
our prototype written by hand. 

We continued with a more realistic case study based on parsing the Java language, 
using the PEG for Java developed by Redziejowski |Red07j . The grammar, consisting of 
216 rules, was automatically translated to TRX format. We immediately hit performance 
problems as our encoding contains a type enumerating all the rules (prod) and proving that 
equality is decidable on this type, using Coq's decide equality tactic, took initially 927 sec. 
(~ 15 minutes). We were able to improve it by writing a tactic dedicated to such simple 
enumeration types (using Coq's Ltac language) and decrease this time to 104 sec. 

We did not meet any more scaling difficulties. Testing XML and Java grammars for 
well-formedness, with the extracted Ocaml code, took, respectively, 0.1 and 0.7 sec. (this 
test needs to be performed only once). We will discuss the performance of the parsing itself, 
and compare it with other tools, in the following section. 

6.2. Performance comparison. For our benchmarking experiment, see Figure [8] on the 
following page, we used the following tools: 

JAXP: a reference implementation for the XML parser, using a DOM parser of the "Java 
API for XML processing", JAXP [JAX] . 

JavaCC: a Java parser [Java] written in Java using JavaCC [Javb] parser generator. 

TRX-cert: the certified TRX interpreter, which is the subject of this paper and is described 
in more detail in Section [5l 

TRX-gen: MLstate's own production-used PEG-based parser generator (for experiments 
we used its simple version without memoization) . 

TRX-int: a simple prototype with comparable functionality to TRX-cert, though devel- 
oped manually. 

Mouse: a PEG-based parser generator, with no memoization, implemented in Java by 
Redziejowski ^Red09j . 

Figure El plots performance of the aforementioned tools on two benchmarks: 
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Figure 8: Performance of certified TRX (TRX-cert) compared to a number of other tools 
on the examples of parsing Java and XML. 



XML: 10 XML file s with a t otal size of 40MB generated using the XML benchmarks 

generator XMark jSWK+02] . 
Java: a complete source code of the J2SE JDK 5.0 consisting of nearly 11.000 files with a 

total size of 117MB. 

The most interesting comparison is between TRX-cert and TRX-int. The latter was es- 
sentially a prototype of the former but developed manually, whereas TRX-cert is extracted 
from a formal Coq development. At the moment the certified version is approximately 
2 — 3x slower. In principle this diff'erence can be attributed either to the verification over- 
head (computations that are but should not be performed, as they are part of the logical 
reasoning to prove correctness and not of the actual algorithm), extraction overhead (sub- 
optimal code generated by the extraction process) or algorithmic overhead (the algorithm 
that we coded in Coq is sub-optimal in itself). 

We believe there is no verification overhead in TRX-cert, as all the correctness proofs 
are discarded by the process of extraction and we never used the proof mode of Coq to 
define objects with computational content (which are extracted). 

The extraction overhead in our case mainly manifests itself in many dispensable con- 
versions. For instance the second component of the sigma type {x : T \ P (x)} is discarded 
during the extraction, so such a type is extracted simply as T and the first projection 
function projl sig as identity. Since sigma types are used extensively in our verification, 
the extracted code is full of such vacuous conversions. However, our experiments seem to 
indicate that Ocaml's compiler is capable of optimizing such code, so that this should have 
no noticeable impact on performance. 

Apart from those two types of overheads associated with extraction, often the sub- 
optimal extracted code can be tracked back to sub-optimal code in the development itself 
or in Coq libraries. We already mentioned few of such problems in Section [6.11 We believe 
another one is the model of characters from the standard library of Coq, Coq. Strings. Ascii, 
which we used in this work. The characters are modeled by 8 booleans, i.e., 8 bits of the 
character: 

Inductive ascii : Set := Ascii ( : bool). 

Not surprisingly such characters induce larger memory footprint and also comparison be- 
tween such structures is much less efficient than between native (1-byte) characters of Ocaml. 
There is an on-going work on improving interplay between Ocaml's native types and their 
Coq counter-parts, which should hopefully address this problem. 
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However, the main opportunity for improving performance seems to be in switching 
from interpretation to code generation. As witnessed by the difference between TRX-int 
and TRX-gen this can have a very substantial impact on performance. We wih say some 
more about that in discussion in Section [HI 

It is worth noting that the performance of TRX-cert is quite competitive when compared 
with Java code generated by Mouse. 

We would like to conclude this section with the observation that even though making 
such benchmarks is important it is often just one of many factors for choosing a proper 
tool for a given task. There are many applications which will never parse files exceeding 
lOOkB and it is often irrelevant whether that will take 0.1s. or 0.01s. For some of those 
applications it may be much more relevant that the parsing is formally guaranteed to be 
correct. 

7. Related Work 

Parsing is a well-studied and well-understood topic and the software for parsing, parser 
generators or libraries of parser combinators, is abundant. And yet there does seem to be 
hardly any work on formally verified parsing. 

Danielsson |DanlO) develops a library of parser combinators (see Hutton |Hut92] ) 
with termination guarantees in the dependently typed functional programming language 
Agda |Agd| (see also joined work with Norell [DN08j ). The main difference in comparison 
with our work is that Danielsson provides a library of combinators, whereas we aim at a 
parser generator for PEG grammars (though at the moment we only have an interpreter). 
Perhaps more importantly, the approach of Danielsson allows many forms of left recursion, 
which we cannot handle at present. Another difference is in the way termination is ensured: 
Danielsson uses dependent types to extend type of parser combinators with the information 
about whether or not they accept the empty string; which is subsequently used to guarantee 
termination. In contrast we use deep embedding of the grammar and a reflective procedure 
to check whether a given grammar is terminating. Some consequences of those choices will 
be explored in more depth in the following section. 

Ideas similar to Danielsson and Norell [D N08] were previously put forward, though just 
as a proof of concept, by McBride and McKinna |MM02j . 

Probably the closest work to ours is that of Barthwal and Norrish [BN09j . where the 
authors developed an SLR parser in HOL. The main differences with our work are: 

• PEGs are more expressive that SLR grammars, which are usually not adequate for real- 
world computer languages, 

• as a consequence of using PEGs we can deal with lexical analysis, while it would have to 
be formalized and verified in a separate stage for the SLR approach. 

• our parser is proven to be totally correct, i.e., correct with respect to its specification 
and terminating on all possible inputs (which was actually far more difficult to establish 
than correctness), while the latter property does not hold for the work of Barthwal and 
Norrish. 

• performance comparison with this work is not possible as the paper does not present any 
case-studies, benchmarks or examples, but the fact that "the DFA states are computed 
on the fly" |BN09] suggests that the performance was not the utmost goal of that work. 
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Finally there is the recent development of a packrat PEG parser in Coq by Wisnesky et 
al. [WMM09] ■ where the given PEG grammar is compiled into an imperative computation 
within the Ynot framework, that when run over an arbitrary imperative character stream, 
returns a parsing result conforming with the specification of PEGs. Termination of such 
generated parsers is not guaranteed. 

8. Discussion and Future Work 

One of the main challenges in developing a certified parser is ensuring its termination. In 
this paper we presented an extrinsic approach to this problem: we use a deep embedding to 
represent parsing expressions in Coq and then develop a certified algorithm to verify that 
a given PEG is well-formed. We then express the parser (interpreter) with non-structural 
recursion and the well-formedness of the grammar allows us to justify that the recursion is 
well-founded. 

There is an alternative, intrinsic approach to the problem of termination, which is, for 
instance, used by Danielsson [DN081 IDanlO| . as mentioned in the previous section. They 
develop a library of parser combinators and use the type system of the host language - in 
this case, Agda - to restrict the parser combinators to well-formed ones. 

This is a very attractive approach, as by cleverly using the type system of the host 
language we obtain certain verified properties for free, hence decreasing the formalization 
overhead. However, it has the usual drawback of a shallow embedding approach: it is tied 
to the host language, i.e. Danielsson's parsers must unavoidably be written in Agda. 

At the moment the same is true about our work: to use certified TRX, as presented in 
this paper, the grammar must be expressed in Coq. However, this is not a necessity with 
our approach, as we will sketch in a moment. The motivation for avoiding the need to use 
Coq is clear: this could make our certified parser technology usable for people outside of 
the small community of theorem provers (Coq, in particular) experts. 

As our work uses deep embedding of parsing expressions, it should be possible to turn 
it into a generic parser generator. Doing so could be accomplished by bootstrapping TRX: it 
should be possible to write a grammar in it that would synthesize a PEG in Coq (in our for- 
mat; Section [5. ip from its textual description. After this transformation the grammar could 
be checked for well-formedness (with our generic procedure for checking well-formedness of 
PEGs; Section 15. 2p finally allowing parsing with this grammar (with our interpreter; Sec- 
tion [5]3]). This would result (via extraction) in a tool that would be capable of parsing 
grammars expressed in a simple textual markup, hence surpassing any need to use/know 
Coq for the users of such a tool. 

The main difficulty with obtaining such a tool lies in the bootstrapping process. To do 
so we would need a kind of a higher-order grammar: a PEG formally describing its own 
syntax, that would take a textual description of a grammar and turn it into a PEG in our 
format. Such a grammar would need to have the type PExp (PExp (_)) and, as already 
hinted in Section [5.11 with our present encoding, that would lead to universe inconsistency 
problems. Also, our current use of module system precludes such use-case as modules are 
not first-class citizens in Coq and one cannot construct higher-order functors. 

But there is a more fundamental problem here: how do we synthesize semantic actions 
from their textual description? If the semantics actions were to be expressed in the calculus 
of constructions of Coq, the way they are now, this seems to be futile. 
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Let US step back a bit for a moment and consider a simpler problem: what if we 
only wanted a recognizer, i.e., a parser that does not return any result, but only indicates 
whether a given string is in the language described by the grammar or not. To address the 
aforementioned problem with modules ( |Chr03| ) we could switch to type classes ( |SO08] ) 
instead. Then we could build a generic recognizer as follows (pseudo-code): 

Definition PEG-grammar : PExp pexp := ... 

Program Definition do_parse {grammar : string) (input : string) := 
match parse PEG-grammar grammar with 

I PR-ok —peg =^ parse [promote peg) input 

I PRJail PRJail 
end. 

Here PEG_grammar is the grammar for PEGs. The main do_parse function takes two 
arguments: grammar with the textual description of the grammar to use and input being 
the input which we want to parse using the given grammar. We use PEG -grammar to 
parse grammar and, hopefully, obtain its internal representation peg : pexp, in which case 
we again invoke parse with promote peg grammar and input as the input string. Extracting 
do_parser would give us a generic recognizer, that could be used without Coq (or any 
knowledge thereof). 

Admittedly, in practice we are rarely interested in merely validating the input; usually 
we really want to parse it, obtaining its structural representation. How can the above ap- 
proach be extended to accommodate that and still result in a stand-alone tool, not requiring 
interaction with Coq? 

One option would be to move from interpretation to code generation and then using 
the target language to express semantic actions. An additional advantage is that this 
should result in a big performance gain (compare the performance of TRX and TRX-int in 
Figure [8]). But that would be a major undertaking requiring reasoning with respect to the 
target language's semantics for the correctness proofs and some sort of (formally verified) 
termination analysis for that language, to ensure termination of the code of semantic actions 
(and hence the generated parser). 

The aforementioned termination problem for a parser generator could be simplified 
by restricting the code allowed in semantic actions to some subset of the target language, 
which is still expressive enough for this purpose but for which the termination analysis is 
simpler. For instance for a purely functional target language one could disallow recursion al- 
together in productions (making termination evident), only allowing use of some predefined 
set of combinators (to improve expressivity of semantic actions), which could be proven 
terminating manually. 

Another solution would be not to use semantic actions altogether, but construct a parse 
tree, the shape of which could be influenced by annotations in the grammar. This is the 
approach used, for instance, in the Ocaml PEG-based parser generator Aurochs |Dur09] . 
We believe this is a promising approach that we hope to explore in the future work. 

A complete different approach to developing a practical, certified parser generator would 
be the standard technique of verification a posteriori: use an untrusted parser that, apart 
from its result, generates some sort of a certificate (parse tree) and develop a (formally 
correct) tool to verify, using the certificate, that the output of the tool (for a given input 
and given grammar) is correct. The attractiveness of this approach lies in the fact that such 
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a verifier would typically be much simpler than the parser itself. There are two problems 
with this approach though: 

• this approach could at best give us partial correctness guarantees, as we would not be 
able to ensure termination of the un-trusted parser (unless we also prove it in some way) ; 

• if the parsing is successul it is relatively clear what a certificate should be (parse tree), but 
what if it is not? How can we certify incorrectness of input with respect to the grammar? 

Apart from making the certified TRX a Coq independent, standalone tool and moving 
from interpretation to code generation we also identify a number of other possible improve- 
ments to TRX as future work: 

(1) Linear parsing time with PEGs can be ensured by using packrat parsing jFor02b) . i.e., 
enhancing the parser with memoization. This should be relatively easy to implement 
(it has, respectively, no and little impact on the termination and correctness arguments 
for certified TRX), but induces high memory costs (and some performance overhead), 
so it is not clear whether this would be beneficial. An alternative would be to develop 
(formally verified?) tools to perform grammar analysis and warn the user in case the 
grammar can lead to exponential parsing times. 

(2) Another important aspect is that of left-recursive grammars, which occur naturally in 
practice. At the moment it is the responsibility of the user to eliminate left-recursion 
from a grammar. In the future, we plan to address this problem either by means 
of left-recursion elimination [For02aj . i.e., transforming a left-recursive grammar to 
an equivalent one where left-recursion does not occur (this is not an easy problem in 
presence of semantic actions, especially if one also wants to allow mutually left-recursive 
rules). Another possible approach is an extension to the memoization technique that 
allows dealing with left-recursive rules [WDM08] . 

(3) Finally support for error messages, for instance following that of the PEG-based parser 
generator Puppy |For02a] . would greatly improve usability of TRX. 

9. Conclusions 

In this paper we described a Coq formalization of the theory of PEGs and, based on it, a 
formal development of TRX: a formally verified parser interpreter for PEGs. This allows 
us to write a PEG, together with its semantic actions, in Coq and then to extract from 
it a parser with total correctness guarantees. That means that the parser will terminate 
on all inputs and produce parsing results correct with respect to the semantics of PEGs. 
Considering the importance of parsing, this result appears as a first step towards a general 
way to bring added quality and security to all kinds of software . 

The emphasis of our work was on practicality, so apart from treating this as an inter- 
esting academic exercise, we were aiming at obtaining a tool that scales and can be applied 
to real-life problems. We performed a case study with a (complete) Java grammar and 
demonstrated that the resulting parser exhibits a reasonable performance. We also stressed 
the importance of making those results available to people outside of the small circle of 
theorem-proving experts and presented a plan of doing so as future work. 
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